Topological AI forecasting of future dominating viral variants

The understanding of the mechanisms of SARS-CoV-2 evolution and transmission is one of the greatest challenges of our time. By integrating artificial intelligence (AI), viral genomes isolated from patients, tens of thousands of mutational data, biophysics, bioinformatics, and algebraic topology, the SARS-CoV-2 evolution was revealed to be governed by infectivity-based natural selection. Two key mutation sites, L452 and N501 on the viral spike protein receptor-binding domain (RBD), were predicted in summer 2020, long before they occur in prevailing variants Alpha, Beta, Gamma, Delta, Kappa, Theta, Lambda, Mu, and Omicron. Recent studies identified a new mechanism of natural selection: antibody resistance. AI-based forecasting of Omicron’s infectivity, vaccine breakthrough, and antibody resistance was later nearly perfectly confirmed by experiments. The replacement of dominant BA.1 by BA.2 in later March was predicted in early February. On May 1, 2022, persistent Laplacian-based AI projected Omicron BA.4 and BA.5 to become the new dominating COVID-19 variants. This prediction became reality in late June. Topological AI models offer accurate prediction of mutational impacts on the efficacy of monoclonal antibodies (mAbs).

SARS-CoV-2 is an extremely sophisticated virus with 29 different proteins. This number includes the spike protein, which enables viral cell entry through its interaction with human angiotensin-converting enzyme 2 (hACE2) at the virus spike receptor-binding domain (RBD). The strength of this interaction is proportional to virus infectivity [8]. Mutations may occur randomly, but natural selection favors RBD mutations that strengthen viral infectivity and evolutionary fitness [4]. For example, approximately 32 of the 50 mutations of COVID-19's Omicron variant are located on the spike protein; non-proportionally, 15 are on RBD to optimize viral evolutionary advantages. [3].
The spike protein is also the main antigenic target of COVID-19 antibodies that are generated by either infection or vaccination. Spike protein-bound antibodies prevent the SARS-CoV-2 virus from interacting with hACE2 and subsequently block viral cell entry, while antibodies that compete with hACE2 on the spike RBD can directly neutralize the virus. The non-covalent binding between a viral spike protein and an antibody works like a zipper, with the virus spike RBD acting as the zipper's upper teeth and the antibody serving as the lower teeth. The RBD mutations cause the zipper to misalign or break off, which leads to the (partial) loss of antibody protection and potential reinfection. In some cases, one or several of the vital RBD mutations can significantly enhance RBD-hACE2 binding and dramatically disrupt the binding between the spike RBD and protective antibodies.
Forecasting of the emerging dominant variants helps policymakers plan preventive measures and allows biopharmaceutical companies additional time to develop future vaccines and antibody drugs. However, such forecasting is one of the most challenging scientific tasks of our time. Identifying the mutations that are vital for virus evolution involves striking complexity; a spike protein consists of more than 1,700 amino acid residues, and its dynamical degree of freedom exceeds 5,100. In contrast, the million-dollar Navier-Stokes existence and smoothness problem concerns only three-dimensional dynamics. Additionally, each residue can mutate into one of 19 alternative amino acids with a wide range of chemical, physical, and biological disparities. This possibility creates an astronomically large mutational space that is scaled as 20 N , (where N is the number of involved amino acid residues), thus making full experimental deep mutational screening unfeasible. Moreover, each set of mutations may potentially contribute to a new viral variant. The genotype-phenotype mapping between a variant and its infectivity and/or antibody resistance is highly nonlinear and involves intricate geometric and combinatorial complexities. Innovative strategies are therefore necessary.
Topology offers a solution to this intriguing problem (see Figure 1). Traditional topology addresses the invariants of a geometric object under continuous deformation; the homeomorphisms and homotopies may not refer to metrics or coordinates and are too abstract to find use in biological analysis. However, persistent homology-a new branch of algebraic topology that employs multiscale analysis to bridge the gap between complex geometry and abstract topology [1,6] -effectively simplifies biomolecular complexity. Persistent homology analyzes biomolecular data in terms of a simplicial complex. The underlying topological space is equipped with filtration to create a family of simplicial complexes, which is a nested sequence of multiscale subsets. But biomolecular systems involve a wide range of interactions -like covalent bonds, hydrogen bonds, van der Waals, electrostatics, hydrophilicity, hydrophobicity, and so forth -that the whole molecular persistent homology analysis would miss. Element-specific persistent homology (ESPH) overcomes this obstacle by embedding physical, chemical, and biological information in topological invariants. The power of ESPH was exemplified through its dominating victories in worldwide annual competitions about computer-aided drug design 1 , which is one of the most competitive fields in modern science [7]. It remains to be seen whether this topological tool can withstand the outburst challenges that are associated with the ongoing COVID-19 pandemic.
Right before the pandemic began, researchers developed an ESPH-based deep learning method that offers state-of-the-art predictions of mutation-induced binding affinity changes of protein-protein interactions -including antibody-antigen, interactions [9]. Scientists integrated this method with genotyping and sequence alignment to create an artificial intelligence (AI) platform, which revealed that SARS-CoV-2 evolution and transmission follows Darwin's natural selection [4]. The study singled out two vital spike protein residues at positions 452 and 501 that "have very high chances to mutate into significantly more infectious COVID-19 strains" long before they occurred in prevailing SARS-CoV-2 variants, such as Alpha, Beta, Gamma, Delta, Epsilon, Theta, Kappa, Lambda, Mu, and Omicron. IDuring this process, ESPH delineates the crucial geometric and biophysical characteristics of mutants in the astronomically large topological space that contributes to virus infectivity, vaccine breakthrough, and antibody resistance.
When reports of the new Omicron variant first emerged in late November 2021, no relevant data was available because experiments had not yet been performed. But within a few days, a topology-based AI platform forecasted Omicron to be nearly three times more infectious than Delta, capable of escaping nearly 90 percent of vaccines, and resistant to essentially all U.S. Food and Drug Administration-approved monoclonal antibodies. Experiments confirmed these predictions in the following weeks [3]. In early 2022, a new subvariant of Omicron called BA.2 started spreading. On February 11, the same topology-based AI platform forecasted BA.2 as the next dominant variant [5]. Six weeks later on March 26, the World Health Organization announced BA.2's global dominance.  Figure 2). These subvariants involve in numerous spike protein RBD mutations with very subtle differences, therefore demanding more discriminative mathematical tools. Although persistent homology is an outstanding tool for the characterization of topological invariants, it is insensitive to the homotopic shape variations in protein-protein interactions that are crucial to viral evolution and transmission. A recent study tackled this challenge with the persistent Laplacian (also known as the persistent spectral graph): a topological Laplacian that is designed to capture both the topological persistence and homotopical shape evolution of data [10]. Its harmonic spectra fully recover the topological invariants of persistent homology, while its nonharmonic spectra unveil homotopical shape evolution. On May 1, 2022, persistent Laplacian-based AI projected Omicron BA.4 and BA.5 to become the new dominating COVID-19 variants [2]. This prediction became reality in late June.